Preliminary analysis for study one and study two.
We have now completed dual-coded data collection for the first study of this project and single-coded data collection for the second study of this project. The preregistered study protocol is available here. Below we present a preliminary analysis of the data and highlight some issues to discuss as a team.
In November, 2019, we made copies of any documentation (e.g., instructions to authors) that provided author-facing statistical guidance provided by the top-15 journals (ranked by Impact Factor) in each of 22 scientific disciplines (i.e., total N = 330 journals).
Operational definition of statistical guidance
any advice or instruction related to the appropriate selection, implementation, reporting, or interpretation of statistical analyses.
We then examined this documentation and recorded whether each journal:
If any statistical guidance was provided by a journal:
All data extraction was performed by a first coder (DS, MM, TB, MSH) and a second coder (TEH) with any coding differences resolved through discussion.
Note: When journals referred to guidance in external sources (typically reporting guidelines), we have recorded the names of the sources, but we have not extracted guidance from those sources.
Note: We found that some publishers provided statistical guidance that was shared across their portfolio of journals (specifically, 31 Nature journals, 12 Cell journals, and 2 Frontiers journals). An additional journal - Scientific Data - shares guidance with other Nature journals, but also has its own specific guidance. Shared publisher guidance was coded by two team members (as above), but is represented in our data and analysis multiple times - once for each individual journal it applies to. For example, the 12 journals published by Cell inherit Cell’s “STAR Methods” guidance, so this guidance appears 12 times in our dataset.
For study two, we planned to more closely examine the guidance provided on five specific topics:
We prespecified some information we wanted to extract about these topics; however, when I (Tom) examined the verbatim guidance, I found our prespecified questions and response options to be unsuitable (ambiguous, badly worded, etc). So in an entirely data-dependent and exploratory manner, I instead coded the level of endorsement for each topic/method according to four categories:
This was admittedly quite subjective and I had to ‘read between the lines’ a bit in order to assign guidance to these categories.
I also categorised reporting advice for three topics as below:
p-values:
Statistical significance:
Confidence intervals:
So far, the data coding for study two has only been performed by me (Tom), though we originally planned for dual-coding (see issues section at the end of this document).
Figure 1 shows the percentage of journals in each scientific discipline (denominator is N = 15) and overall (denominator is N = 330) that:
For tabular data, see the table in the Appendix.
You can see that around half (48%) of journals offered any statistical guidance. Note that this includes 32 journals which only referred to statistical guidance in external sources (reporting guidelines or academic papers). Just over quarter of journals had a dedicated statistical guidance section in their author instructions (28%). In two fields (Computer Science and Maths), no journals offered any statistical guidance. Journals in health-related fields were more likely to offer statistical guidance and have dedicated statistical guidance sections. Notably, 100% of the journals in clinical medicine offered some statistical guidance.
Figure 1: Percentage of journals offering statistical guidance by scientific discipline (N = 15; represented by coloured dots) and overall (N = 330; represented by black diamond).
Figure 2 shows the percentage of journals in each scientific discipline (denominator is N = 15) and overall (denominator is N = 330) that mentioned each of the twenty statistical topics we pre-selected. For tabular data, see the table in the Appendix. Please see the issues section at the end of this document for an important note regarding the topic “prespecification of analyses”.
You can see that some topics (e.g., confidence intervals, p-values) are much more likely to be mentioned than other topics (e.g., handling outliers, categorisation of continuous data). Journals in health-related fields, particularly clinical medicine, are often more likely to mention this collection of topics than journals in other fields.
Figure 2: Percentage of journals offering guidance on twenty statistical topics by scientific discipline (N = 15; represented by coloured dots) and overall (N = 330; represented by black diamond). For presentational purposes, two scientific disciplines in which no journals offered any statistical guidance at all are not shown. Additionally, disciplines offering no guidance on individual topics are shown, but not labelled. Graphs are ordered from left to right and top to bottom in order of highest proportion offering topic guidance overall across scientific disciplines.
For the 128 journals that offered some internal guidance (i.e., excluding those that offered no guidance at all and those that only referred to external sources), the histogram in Figure 3 illustrates that the maximum number of topics mentioned by an individual journal was 15. There were 9 journals that did not mention any of our prespecified topics (but provided statistical guidance on other topics). The median topics mentioned was 6.
Figure 3: Histogram showing how many of the twenty statistical topics were mentioned by each journal
In the grid below (Figure 4), you can see which of the twenty preselected topics were mentioned by each journal. Journals that shared publisher guidance are represented by one row (per publisher). Journals that did not offer any statistical guidance at all, or journals that referred only to guidance in external sources, are not shown.
Figure 4: Grid diagram showing whether each journal provided guidance on each of twenty preselected statistical topics. Topics are ordered from most (left) to least (right) mentioned. Journals are ordered from most (top) to least (bottom) preselected topics mentioned.
To give you a flavour of the kind of guidance provided, I have randomly selected two examples (excluding a few very long examples), for each of the twenty statistical topics, and displayed them in Table 1 below.
| topic | example guidance |
|---|---|
| baseline covariates | Statistical tests, along with reported P values, for comparing groups at baseline are not necessary unless there is a strong reason to include them. (JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY) |
| baseline covariates | For randomized trials using parallel-group design, there is no validity in conducting hypothesis tests regarding the distribution of baseline covariates between groups; by definition, these differences are due to chance. Because of this, tables of baseline participant characteristics should not include P values or statements of statistical comparisons among randomized groups. Instead, report clinically meaningful imbalances between groups, along with potential adjustments for those imbalances in multivariable models (JAMA Neurology) |
| Bayesian statistics | Authors are expected to apply the most appropriate statistical tools for data analysis, and it is acceptable to present results from frequentist, information-theory, and Bayesian approaches in the same manuscript. (BRAIN) |
| Bayesian statistics | Inherited Nature [LSR] guidance: For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings (Scientific Data) |
| categorisation of continuous data | Categorizing of continuous data (e.g. into quartiles, quintiles) is discouraged. It leads to a loss of information, usually needs more complicated methods than for continuous data and introduces demarcations which are valid only for this particular study. (EUROPEAN HEART JOURNAL) |
| categorisation of continuous data | If the data are appropriate, age grouping should be mid-decade to mid-decade or in five-year age groups (e.g. 35-44 or 35-39, 40-44, etc, but not 20-29, 30-39 or other groupings. (INTERNATIONAL JOURNAL OF EPIDEMIOLOGY) |
| confidence intervals | Descriptive statistics should include a clearly labelled measure of centre (such as the mean or the median), and a clearly labelled measure of variability (such as standard deviation or range). Ranges are more appropriate than standard deviations or standard errors for small data sets. Standard error or confidence interval is appropriate to compare data to a control. (EMBO JOURNAL) |
| confidence intervals | For intervention studies, the abstract should include the primary outcome expressed as the difference between groups with a confidence interval on that difference (absolute differences are more useful than relative ones). Secondary outcomes can be included as long as they are clearly marked as secondary and all such outcomes are reported. (LANCET ONCOLOGY) |
| effect sizes | SMJ also requires in papers accepted for publication that authors explicitly discuss and interpret effect sizes of relevant estimated coefficients…In addition, authors of submitted papers should address the material significance (magnitude) of the results, in addition to statistical significance. (STRATEGIC MANAGEMENT JOURNAL) |
| effect sizes | P values alone are not sufficient to report the results of statistical tests. The JACI’S readers need to see the magnitude of the effects via point estimates and 95% confidence intervals for the group comparisons…[Report] For each primary and secondary outcome , a summary of results for each group, and the estimated effect size and its precision (e.g., 95% confidence interval). (JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY) |
| data exclusions | Data pre-processing steps such as transformations, re-coding, re-scaling, normalization, truncation, and handling of below detectable level readings and outliers should be fully described; any removal or modification of data values must be fully acknowledged and justified. (SCIENCE) |
| data exclusions | Each manuscript should clearly state …patients or participants with inclusion and exclusion criteria (JAMA Neurology) |
| handling missing data | Report losses to observation, such as dropouts from a clinical trial or those lost to follow-up or unavailable in an observational study. Consider multiple imputation methods to impute missing data and include an assessment of whether data were missing at random. Approaches based on “last observation carried forward” should not be used (JAMA Internal Medicine) |
| handling missing data | Clearly indicate the number of observations for each analysis or experiment, along with information about missing data. (ENVIRONMENTAL HEALTH PERSPECTIVES) |
| checking model assumptions | It is important that the author be satisfied that the assumptions behind any statistical analysis are sufficiently met and that, at least where unusual assumptions are made, unusual procedures are used, or unusual types of data are involved, and that the reader be provided with sufficient information to judge whether any departures from assumptions are severe enough to vitiate the conclusions. The amount of detail provided in any particular instance will depend on the centrality of the statistical test to the conclusions. (ECOLOGICAL MONOGRAPHS) |
| checking model assumptions | For multivariable models, report all variables included in models, and report model diagnostics and overall fit of the model when available (JAMA Neurology) |
| multiple comparisons | Reports of studies with multiple or secondary end points should address the multiple comparison issues and describe the exploratory nature of the studies. (JOURNAL OF CLINICAL ONCOLOGY) |
| multiple comparisons | Selection of endpoints. Were the primary and secondary endpoints prospectively selected? If multiple endpoints were assessed, were the appropriate statistical corrections applied?…Adjustments made to alpha levels (e.g., Bonferroni correction) or other procedure used to account for multiple testing (e.g., false discovery rate control) should be reported….more than two significant digits on p-values are usually not needed except in situations of extreme multiple testing such as in genetic association studies where stringent corrections for multiple testing might be used. (Science Translational Medicine) |
| non-parametric tests | For statistical analysis of small sample sizes, authors should use tests suitable for small sizes and provide a justification for the test used. (Cancer Discovery) |
| non-parametric tests | Normal distribution: Many statistical tests require that the data be approximately normally distributed; when using these tests, authors should explain how they tested their data for normality. If the data do not meet the assumptions of the test, then a non-parametric alternative should be used instead. (Scientific Reports) |
| null hypotheses | For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted [LSR] (Nature Plants) |
| null hypotheses | Inherited Nature [LSR] guidance: For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted (Scientific Data) |
| one-sided tests | State any a priori levels of significance and whether hypothesis tests were 1- or 2-sided. (JAMA Oncology) |
| one-sided tests | State whether tests were one- or two-tailed. (JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY) |
| handling outliers | If you encountered any outliers, you should describe how these were handled (eLife) |
| handling outliers | Outliers. How were outliers defined and handled? Were they defined before the beginning of the study? Have you reported outliers that were excluded?…Data pre-processing steps such as transformations, re-coding, re-scaling, normalization, truncation, and handling of below detectable level readings and outliers should be fully described; any removal or modification of data values must be fully acknowledged and justified. (Science Translational Medicine) |
| p values | Quantitative studies: abstracts should provide effect sizes with confidence intervals (not P-values alone). (BRITISH JOURNAL OF PSYCHIATRY) |
| p values | If p-values are presented, one-sided or two-sided should be specified. If one-sided, justification should be provided. Confidence intervals should also accompany the parameter for which statistical significance is being tested. (NUCLEIC ACIDS RESEARCH) |
| prespecification of analyses | All clinical trials that have begun randomization must be registered at an appropriate online public registry (see Trial Registration requirements)…Both randomized and observational studies should identify the primary outcome(s) before the study began, as well as any prespecified secondary, subgroup, and/or sensitivity analyses. Comparisons arrived at during the course of the analysis or after the study was completed should be identified as post hoc…Analyses should follow EQUATOR Reporting Guidelines and be consistent with the protocol and statistical analysis plan, or described as post hoc. (JAMA Internal Medicine) |
| prespecification of analyses | Authors are required to pre-register clinical trials with an international clinical trials register or and to cite a reference to the registration in the Methods section. Suitable databases include clinicaltrials.gov, the EU Clinical Trials Register and those listed by the World Health Organisation International Clinical Trials Registry Platform. (Nutrients) |
| sample size justification | Sample Size Calculations. For randomized trials, a statement of the power or sample size calculation is required (see the EQUATOR Network CONSORT Guidelines). For observational studies that use an established population, a power calculation is not generally required when the sample size is fixed. However, if the sample size was determined by the researchers, through any type of sampling or matching, then there should be some justification for the number sampled. In any case, describe power and sample size calculations at the beginning of the Statistical Methods section, following the general description of the study population (JAMA Neurology) |
| sample size justification | QUANTIFICATION AND STATISTICAL ANALYSIS…please summarize in this section…sample size estimation… Is there information related to experimental design? …Sample size estimation and statistical method of computation (Cell Metabolism) |
| secondary outcomes | Selection of endpoints. Were the primary and secondary endpoints prospectively selected? If multiple endpoints were assessed, were the appropriate statistical corrections applied? (Science Translational Medicine) |
| secondary outcomes | Since multiple statistical testing methods are frequently used in genotyping-phenotyping studies, please include specifics of the primary model(s) tested. Non-essential secondary models may be published as electronic data supplements. (CIRCULATION) |
| sensitivity analyses | Meta-analyses should state the major outcomes that were pooled and include odds ratios or effect sizes and, if possible, sensitivity analyses. Both randomized and observational studies should identify the primary outcome(s) before the study began, as well as any prespecified secondary, subgroup, and/or sensitivity analyses. (JAMA Internal Medicine) |
| sensitivity analyses | Meta-analyses should state the major outcomes that were pooled and include odds ratios or effect sizes and, if possible, sensitivity analyses…Both randomized and observational studies should identify the primary outcome(s) before the study began, as well as any prespecified secondary, subgroup, and/or sensitivity analyses. (JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION) |
| statistical significance | Define statistical terms, abbreviations, and symbols, if included. Avoid nontechnical uses of technical terms in statistics, such as correlation, normal, predictor, random, sample, significant, trend. Do not use inappropriate hedge terms such as marginal significance or trend toward significance for results that are not statistically significant…State any a priori levels of significance and whether hypothesis tests were 1- or 2-sided…Report basic numbers only but state if results are statistically significant or not significant; (JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION) |
| statistical significance | If necessary, present results of tests of significance, such as analysis of variance, in addition to tests of variability. (NEW PHYTOLOGIST) |
| subgroup analyses | Separate reporting of data by demographic variables, such as age and sex, facilitates pooling of data for subgroups across studies and should be routine, unless inappropriate. Discuss the influence or association of variables, such as sex and/or gender, on your findings, where appropriate, and the limitations of the data. (LANCET ONCOLOGY) |
| subgroup analyses | 8. Meta-analysis… State whether you identified potential sources of heterogeneity prior to initiating your review and analyses, and describe whether and how you carried out subgroup or sensitivity analyses or meta-regression to explore that heterogeneity. (ANNALS OF INTERNAL MEDICINE) |
137 journals referred authors to statistical guidance in an external source (reporting guidelines or other sources like academic papers or websites). Of these, 32 only referred to statistical guidance in external sources, whereas the other 105 also provided their own guidance in additional to referring to external sources.
In the histogram (Figure 5) below, you can see how frequently journals referred to different reporting guidelines. Then in Table 2 you can see all of the other external guidelines (academic papers and websites) that journals referred to. None of these external sources that have been examined by us in detail.
Figure 5: Histogram showing how frequently journals referred to different reporting guidelines.
Other external sources (not reporting guidelines):
| external_guidance | n_journals |
|---|---|
| cummings & rivara 2003 (10.1001/archpedi.157.4.321) | 3 |
| cumming et al. 2007 (10.1083/jcb.200611141) | 2 |
| olsen 2003 (10.1128/iai.71.12.6689-6692.2003) | 2 |
| olsen 2014 (10.1128/iai.00811-13) | 2 |
| richardson & overbaugh 2005 (10.1128/jvi.79.2.669-676.2005) | 2 |
| altman et al. 1983 (10.1136/bmj.286.6376.1489) | 1 |
| apa manual | 1 |
| asa (https://www.amstat.org/asa/your-career/ethical-guidelines-for-statistical-practice.aspx) | 1 |
| boushey et al. 2006 (10.1016/j.jada.2005.11.007) | 1 |
| boushey et al. 2008 (10.1016/j.jada.2008.01.002) | 1 |
| bruemmer et al. 2009 (10.1016/j.jada.2009.07.011) | 1 |
| gleason et al. 2010 (10.1016/j.jada.2009.11.022) | 1 |
| gleason et al. 2015 (10.1016/j.jand.2015.03.011) | 1 |
| harris & raynor 2017 (10.1016/j.jand.2017.03.017) | 1 |
| harris et al. 2008 (10.1016/j.jada.2008.06.426) | 1 |
| harris et al. 2009 (10.1016/j.jada.2008.10.018) | 1 |
| harris et al. 2012 (10.1016/j.jada.2011.09.037) | 1 |
| hewitt 2012 (10.1007/s10519-011-9504-z) | 1 |
| hollingshead 2008 (10.1093/jnci/djn351) | 1 |
| http://www.biostathandbook.com/ | 1 |
| http://www.utdallas.edu/~serfling/3332/biology_statistics_made_simple_using_excel.pdf | 1 |
| https://www.ncbi.nlm.nih.gov/books/nbk153593/ | 1 |
| kempen 2011 (10.1016/j.ajo.2010.08.047) | 1 |
| motulsky 2014 (10.1124/jpet.114.219170) | 1 |
| poldrack et al. 2008 (10.1016/j.neuroimage.2007.11.048) | 1 |
| sheean 2011 (10.1016/j.jada.2010.10.010) | 1 |
| simon et al. 2009 (10.1093/jnci/djp335) | 1 |
| sullivan et al. 2016 (10.1161/jaha.116.004142) | 1 |
| table 3 in dupuy & simon 2007 (10.1093/jnci/djk018) | 1 |
| zoellner & harris 2017 (10.1016/j.jand.2017.01.018) | 1 |
| zoellner et al. 2015 (10.1016/j.jand.2015.03.010) | 1 |
Study two involved a closer look at five preselected statistical topics. Table 3 shows how many journals offerred different levels of endorsement for these statistical methods. You can see that it was fairly rare to oppose the use of any of these methods. A few journals are taking a stand against the use of “statistical significance”, but this is rare. Few journals are strongly advocating the use of p-values, but the vast majority find them acceptable. It was quite common to recommend reporting of effect sizes and confidence intervals.
| topic | Explicit endorsement | Implicit endorsement | Implicit opposition | Explicit opposition |
|---|---|---|---|---|
| confidence intervals | 85 | 4 | 1 | 0 |
| effect sizes | 62 | 4 | 0 | 0 |
| p-values | 10 | 77 | 0 | 1 |
| sample size justification | 67 | 5 | 0 | 0 |
| statistical significance | 9 | 35 | 5 | 3 |
Note that the one journal with implicit opposition to confidence intervals said this was only in the case of small sample sizes. And the one journal with explicit opposition to p-values said this was only when there were no pre-specified multiplicity corrections.
Of the 88 journals offering guidance on p-values, 52 advised reporting of exact p-values, 21 advised reporting of exact p-values unless they were very small (e.g., < .001), 5 advised to use reporting thresholds (e.g., <.05, <.01, <.001), and 10 offered no guidance on how to report p-values.
Of the 53 journals offering guidance on statistical significance, 30 advised authors report the alpha level they used, 2 advised using an alpha level of .05, 1 advised using an alpha level of .01, and 20 offered no guidance on reporting statistical significance.
Of the 90 journals offering guidance on confidence intervals, 16 advised authors to report 95% confidence intervals, 5 advised authors report their chosen confidence level, and 69 offered no guidance on reporting confidence intervals.
In two fields (Computer Science and Maths), no journals offered any statistical guidance - I do not know these fields well, but I wonder if this indicates that statistical tools are rarely used, in which case one might reasonably suggest we should not even have bothered to examine these fields. Any thoughts on this?
Our coding for the topic “prespecification of analyses” was not consistent across coders. Specifically, some coders considered that this topic also covered any guidance/instructions related to registration of clinical trials, whilst others did not. I believe the justification of those who did not was that clinical trials registration does not necessarily involve prespecification of a statistical analysis plan. That seems a reasonable justification to me, but I’d be interested to hear thoughts from the team. If we decide that advice about clinical trials registration alone does not fall into the remit of this topic, then it is relatively straightforward to examine the verbatim text we extracted and re-code the relevant cases. However, if we decide that advice about clinical trials registration does fall under the remit of this topic, then we will need to re-visit the journal instructions to authors and search for such guidance (as this was not done consistently previously).
We have recorded whether journals refer to statistical guidance from external sources, such as reporting guidelines or academic papers. However, we have not examined those external sources, and we need to decide whether to do so. If we perform extraction and coding for all 81 external sources we have identified, this is clearly a lot of extra work. A counter-argument is that we are only really interested in the most salient statistical guidance offered directly by journals. On the other hand, if we do not extract information from external sources, then perhaps we are missing an important aspect of statistical guidance i.e., that journals may not provide their own guidance on certain topics because they are aptly covered by external sources. An additional complication is that in some cases, the boundary of internal vs external guidance is blurred when the ‘external’ source is an academic article that appears to have been written by members of the journal’s editorial team.
In the protocol we say that for Study 2, data extraction and coding will be in duplicate. However, we do caveat this by saying it might be necessary to deviate from the plan depending on our resources. I have done all of the first coding, do we have the capacity to do duplicate coding? My worry is that it will slow us down considerably, and I am keen to get these results published before they are outdated (the statistical guidance was originally extracted in November, 2019).
Tabular data (represented in Figures 1 and 2). Number and percentage of journals overall and by scientific field providing internal statistical guidance by topic (click the black arrow to scroll through the data for each field). The denominator for proportions is the number of journals in the field (N = 15) or overall (N = 330).